{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"标题\" style=\"width: 400px;\"/> </a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 通过 Numba 使用 CUDA 的多维网格和共享内存"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本节将介绍其他几个使用 Numba 在 CUDA 中编程的中级技巧。首先，我们将介绍 CUDA 如何支持创建二维和三维线程块及网格。凭借这种支持，我们在使用二维和三维数据进行编程时将更加容易。接下来，我们将介绍一种由程序员管理的片上内存空间（名为**共享内存**），该空间可用于同一个线程块内的线程之间的极速通信。您将有机会使用这两种技巧优化二维矩阵乘法的核函数。\n",
    "\n",
    "本节还提供带示例的附录，其中会讲解如何减少矩阵转置算法的**共享内存存储体冲突**，同时还附有资源链接，可供您进一步学习。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 目标\n",
    "\n",
    "完成本节必要部分的学习后，您将能够：\n",
    "\n",
    "* 使用多维线程块和网格，对多维数据集执行 GPU 加速并行工作。\n",
    "* 使用共享内存在片上缓存数据，减少缓慢的全局内存访问。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二维和三维线程块及网格\n",
    "\n",
    "网格和块均可分别进行配置，使之包含二维或三维线程块或线程的集合。多数情况下，此方法旨在为经常使用二维或三维数据集的程序员提供方便。下面举一个简单的例子，重点在于说明其采用的语法。您可能需要*同时*阅读核函数定义及其启动，才能理解该概念。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "A = np.zeros(16).reshape(4, 4).astype(np.int32)\n",
    "d_A = cuda.to_device(A)\n",
    "\n",
    "@cuda.jit\n",
    "def add_2D_coordinates(A):\n",
    "    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid\n",
    "    x, y = cuda.grid(2)    # here, x is the column index, y is the row index\n",
    "    \n",
    "    A[y][x] = x + y        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure\n",
    "# by using a Python tuple to signify grid and block dimensions.\n",
    "blocks = (2, 2)\n",
    "threads_per_block = (2, 2)\n",
    "\n",
    "add_2D_coordinates[blocks, threads_per_block](d_A)\n",
    "print(d_A.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 关于练习的注解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们要记住的重要一点是，若要 GPU 出色地运行，您就需给予 GPU 足够大的网格，使流多处理器 (SM) 在等待那些尚未完成任务的操作到期之前，始终有其它工作需要执行。明确该要点后，您还需注意，本节中部分练习会要求您编写不符合此最佳实践的核函数，这样做的目的是在让您有机会专注于在多个维度上工作时所涉及的语法和一些基本模式，并有一定的时间来适应。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习：在 GPU 上执行二维矩阵加法运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本练习需要您修改核函数定义及其启动配置，以实现二维矩阵加法并行运算。您可以先编写一个“质朴的”核函数，它仅在网格形状与数据形状相符的条件下被启动时才能正常运行。如您遇到问题，请随时参阅 [此解决方案](../edit/solutions/add_matrix_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This function is written to add two 1D vectors. Refactor it to add 2D matrices\n",
    "@cuda.jit\n",
    "def add_matrix(A, B, C):\n",
    "    j, i = cuda.grid(2)\n",
    "    \n",
    "    C[i, j] = A[i, j] + B[i, j]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Do not modify the values in this cell, which defines 2D matrices of size 36*36\n",
    "A = np.arange(36*36).reshape(36, 36).astype(np.int32)\n",
    "B = A * 2\n",
    "C = np.zeros_like(A)\n",
    "d_A = cuda.to_device(A)\n",
    "d_B = cuda.to_device(B)\n",
    "d_C = cuda.to_device(C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Refactor the launch configuration to use a 2D grid with 2D blocks\n",
    "# Refactor the launch configuration to use a 2D grid with 2D blocks\n",
    "blocks = 36\n",
    "threads_per_block = 36\n",
    "\n",
    "# This launch will throw a Typing error until refactor the definition above to operate on 2D arrays\n",
    "add_matrix[blocks, threads_per_block](d_A, d_B, d_C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numpy import testing\n",
    "output = d_C.copy_to_host()\n",
    "solution = A+B\n",
    "# This assertion will fail unles the output and solution are equal\n",
    "testing.assert_array_equal(output, solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习：在 GPU 上执行二维矩阵乘法运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本练习需要您完成核函数的代码，以计算二维 [矩阵乘法](https://en.wikipedia.org/wiki/Matrix_multiplication) 运算中的一个元素。与刚刚编写的矩阵加法核函数类似，该核函数同样也是一个“质朴的”函数，因为它要求网格与做为参数传递进来的矩阵具有相同的维度。如您遇到问题，请参阅 [此解决方案](../edit/solutions/matrix_multiply_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def mm(a, b, c):\n",
    "    row, column = cuda.grid(2)\n",
    "    sum = 0\n",
    "    \n",
    "    ###\n",
    "    # TODO: Build the rest of this kernel to calculate the value for one element in the output matrix.\n",
    "    ###\n",
    "        \n",
    "    c[row][column] = sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Do not modify the values in this cell\n",
    "a = np.arange(16).reshape(4,4).astype(np.int32)\n",
    "b = np.arange(16).reshape(4,4).astype(np.int32)\n",
    "c = np.zeros_like(a)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_b = cuda.to_device(b)\n",
    "d_c = cuda.to_device(c)\n",
    "\n",
    "grid = (2,2)\n",
    "block = (2,2)\n",
    "mm[grid, block](d_a, d_b, d_c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numpy import testing\n",
    "solution = a@b\n",
    "output = d_c.copy_to_host()\n",
    "# This assertion will fail until you successfully implement the kernel\n",
    "testing.assert_array_equal(output, solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多维中的步长"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正如可以使用 Numba 的 `cuda.gridsize(1)` 获取网格中的线程总数，我们也能使用 `cuda.gridsize(2)` 获取二维网格每个方向上的线程总数。该功能非常实用。例如，如果二维数据集大于网格，我们会需要每个线程在循环中跨过网格，以便覆盖所有必要工作，此时该功能便可派上用场。\n",
    "\n",
    "正如一维跨网格循环，该技巧同样增加了网格和块尺寸的灵活性，且无需考虑数据的形状。以下示例展示了二维跨网格循环的使用效果，特别重要的是，最终打印的信息显示出网格中的哪个线程处理矩阵中的哪个元素。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda\n",
    "@cuda.jit\n",
    "def add_2D_coordinates_stride(A):\n",
    "\n",
    "    grid_y, grid_x = cuda.grid(2)\n",
    "    # By passing `2`, we get the grid size in both the x an y dimensions\n",
    "    stride_y, stride_x = cuda.gridsize(2)\n",
    "    \n",
    "    for data_i in range(grid_x, A.shape[0], stride_x):\n",
    "        for data_j in range(grid_y, A.shape[1], stride_y):\n",
    "            A[data_i][data_j] = grid_x + grid_y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此处采用 3x2 结构创建了二维网格，网格中包含 6 个块，且每个块均采用 3x2 结构并由 6 个线程组成。该网格不仅小于我们的总数据集，而且网格形状也不能均匀适配数据集维度。即便如此，核函数依然能够访问数据中的每个元素。运行此单元后，尝试像网格一样调整数据的形状。运行代码前，您还可以尝试预测输出矩阵的值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A = np.zeros(55).reshape(11, 5).astype(np.int32)\n",
    "d_A = cuda.to_device(A)\n",
    "\n",
    "blocks = (3, 2)\n",
    "threads_per_block = (3, 2)\n",
    "\n",
    "# With this configuration, `stride_x` will be 9, and `stride_y` will be 4\n",
    "add_2D_coordinates_stride[blocks, threads_per_block](d_A)\n",
    "print(d_A.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习：对大于网格尺寸的二维矩阵执行加法运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本练习需要您调整以上“质朴的”矩阵加法核函数，使其能够适于处理任意尺寸的数据集。如您遇到问题，请随时参阅 [此解决方案](../edit/solutions/add_matrix_stride_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Currently this kernel will only work correctly when passed matrices that are of the same size as the grid.\n",
    "# Refactor using a strid in 2D so that it can work on data sets of an arbitrary size.\n",
    "@cuda.jit\n",
    "def add_matrix_stride(A, B, C):\n",
    "    j,i = cuda.grid(2)\n",
    "    \n",
    "    C[i,j] = A[i,j] + B[i,j]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Please don't modify the values in this cell. They create a scenario where the data is\n",
    "# larger than the grid size.\n",
    "A = np.arange(64*64).reshape(64, 64).astype(np.int32)\n",
    "B = A * 2\n",
    "C = np.zeros_like(A)\n",
    "d_A = cuda.to_device(A)\n",
    "d_B = cuda.to_device(B)\n",
    "d_C = cuda.to_device(C)\n",
    "\n",
    "blocks = (6,6)\n",
    "threads_per_block = (6,6)\n",
    "\n",
    "add_matrix_stride[blocks, threads_per_block](d_A, d_B, d_C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numpy import testing\n",
    "output = d_C.copy_to_host()\n",
    "solution = A+B\n",
    "# This assertion will fail unles the output and solution are equal\n",
    "testing.assert_array_equal(output, solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习：对大于网格尺寸的二维矩阵执行乘法运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本练习需要您完成矩阵乘法核函数，使其能够适用于任意网格和数据集形状。您只需编写两行包含 `TODO` 的代码，并使用 `grid_` 和 `stride_` 的值，将所执行的核函数中的工作正确映射到数据中。如您遇到问题，请参阅 [此解决方案](../edit/solutions/matrix_multiply_stride_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda\n",
    "@cuda.jit\n",
    "def mm_stride(A, B, C):\n",
    "\n",
    "    grid_row, grid_column = cuda.grid(2)\n",
    "    stride_row, stride_column = cuda.gridsize(2)\n",
    "    \n",
    "    for data_row in range(0): # TODO: replace 0 with values that will correctly set data_row\n",
    "        for data_column in range(0): # TODO: replace 0 with values that will correctly set data_column\n",
    "            sum = 0\n",
    "            for i in range(A.shape[1]): # B.shape[0] would also be okay here\n",
    "                sum += A[data_row][i] * B[i][data_column]\n",
    "                \n",
    "            C[data_row][data_column] = sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please do not modify this cell. The strange dimensions of this data, and\n",
    "# the grid below are being set to make sure your kernel correctly handles arbitrary\n",
    "# data and grid sizes.\n",
    "\n",
    "a = np.arange(12).reshape(3,4).astype(np.int32)\n",
    "b = np.arange(24).reshape(4,6).astype(np.int32)\n",
    "c = np.zeros((a.shape[0], b.shape[1])).astype(np.int32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_b = cuda.to_device(b)\n",
    "d_c = cuda.to_device(c)\n",
    "\n",
    "ts = (4, 3)\n",
    "bs = (3, 7)\n",
    "\n",
    "mm_stride[bs, ts](d_a, d_b, d_c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numpy import testing\n",
    "solution = a@b\n",
    "output = d_c.copy_to_host()\n",
    "# This assertion will fail until you correctly update the kernel above.\n",
    "testing.assert_array_equal(output, solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 共享内存\n",
    "\n",
    "目前为止，我们一直都在区分主机和设备内存，而设备内存就像某种单独的内存类型。但实际上，CUDA 拥有更为精细的[内存层次结构](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy)。我们至今一直使用的设备内存称为**全局内存**。设备上的任何线程或块都可使用该内存，其存续时间可贯穿应用程序的整个生命周期，且内存空间相对较大。\n",
    "\n",
    "作为最后一个话题，我们将探讨如何使用一种名为**共享内存**的片上设备内存区域。共享内存是由程序员定义的缓存，容量有限。此类内存 [取决于所使用的 GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)，且可为同一个线程块内所有线程**共享**。共享内存是一种稀缺资源。若线程位于分配有共享内存的线程块之外，则无法访问此共享内存，且此类内存在核函数执行完毕后即会被释放。然而，共享内存带宽远高于全局内存，可在许多核函数中发挥出色的使用效果，尤其是有助于优化性能。\n",
    "\n",
    "以下为共享内存的几个常见使用实例：\n",
    "\n",
    "* 缓存从全局内存中读取的内存，以便在线程块内多次读取。\n",
    "* 缓存线程输出，以便在重新写入全局内存之前对计算结果进行合并。\n",
    "* 暂存数据，用于线程块内的分发和收集（scatter/gather）操作。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 共享内存语法\n",
    "\n",
    "Numba 提供用于分配共享内存和在块内线程之间执行同步操作的[函数](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization)，且这些函数对于并行线程读取或写入共享内存后的操作往往必不可少。\n",
    "\n",
    "声明共享内存时，您需使用[Numba 类型](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types)提供共享阵列的形状和类型。**阵列形状必须为常数**，因而您不可使用函数中传入的参数，也不得提供类似于 `numba.cuda.blockDim.x` 的变量或计算出的 `cuda.gridDim` 值。\n",
    "\n",
    "下面是一个复杂的示例，展示了分配共享内存所采用的语法，其中的注释指出了数据从主机内存、全局设备内存到共享内存，然后依次返回至全局设备内存和主机内存的循环过程："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import types, cuda\n",
    "\n",
    "@cuda.jit\n",
    "def swap_with_shared(x, y):\n",
    "    # Allocate a 4 element vector containing int32 values in shared memory.\n",
    "    temp = cuda.shared.array(4, dtype=types.int32)\n",
    "    \n",
    "    idx = cuda.grid(1)\n",
    "    \n",
    "    # Move an element from global memory into shared memory\n",
    "    temp[idx] = x[idx]\n",
    "    \n",
    "    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...\n",
    "    cuda.syncthreads()\n",
    "    #...the following operation is reading an element written to shared memory by another thread.\n",
    "    \n",
    "    # Move an element from shared memory back into global memory\n",
    "    y[idx] = temp[cuda.blockDim.x - cuda.threadIdx.x - 1] # swap elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = np.arange(4).astype(np.int32)\n",
    "y = np.zeros_like(x)\n",
    "\n",
    "# Move host memory to device (global) memory\n",
    "d_x = cuda.to_device(x)\n",
    "d_y = cuda.to_device(y)\n",
    "\n",
    "swap_with_shared[1, 4](d_x, d_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Move device (global) memory back to the host\n",
    "d_y.copy_to_host()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 共享内存示例：矩阵转置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为举例展示共享内存的能力，并进一步演示二维 CUDA 编程，下面就让我们编写一个矩阵转置核函数。具体方法是把以行为主序的二维阵列再以列为主序进行排列。（此示例基于 Mark Harris 的博文：[高效矩阵转置](https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/)）。\n",
    "\n",
    "在这个例子中，我们将使用共享内存作为一种机制，用来实现对全局内存进行合并读取和合并写入。对于转置算法来说，合并读/写通常是不可能的，但是由于共享内存的访问是如此之快，以至于我们可以对其进行“非合并”读取或写入而不会造成性能损失，以便实现随后的以合并的方式对全局内存进行读取和/或写入。该示例将证明这一点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 不使用共享内存的矩阵转置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在展示共享内存的实现之前，让我们先用一个简单的方法，让每个线程仅使用全局内存独立地读写各个矩阵元素。 这样，当您看到共享内存示例时，可以专注于共享内存如何影响算法，而不是算法本身的基础知识："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def transpose(a_in, a_out):\n",
    "    # Explicitly calculate indices rather than using cuda.grid(2)\n",
    "    row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
    "\n",
    "    a_out[row, col] = a_in[col, row]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "size = 16384\n",
    "a_in = cuda.to_device(np.arange(size*size, dtype=np.int32).reshape((size, size)))\n",
    "a_out = cuda.device_array_like(a_in)\n",
    "\n",
    "print(a_in.copy_to_host())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "threads_per_block = (32, 32)\n",
    "blocks_per_grid = (int(size/threads_per_block[0]), int(size/threads_per_block[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit transpose[blocks_per_grid, threads_per_block](a_in, a_out); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(a_out.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用共享内存的矩阵转置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，让我们使用共享内存实现这个算法。为此：\n",
    "\n",
    "1. 每个线程块将创建一个32x32元素的共享内存阵列\n",
    "2. 每个线程块将从全局内存中的输入阵列合并读取到共享内存阵列中\n",
    "3. 在将值写回到全局内存之前，块中的每个线程将等待该块中的所有其他线程使用线程同步来完成其读取操作\n",
    "4. 我们从共享内存对全局内存进行合并写入，并以转置顺序进行写入。我们将分两步进行操作，首先是将共享内存阵列对应的在输入阵列中的位置进行转置（下面的4a），然后在将数据元素写回到全局内存之前（在下面的4b中）转置该元素在共享内存阵列中的位置。 \n",
    "\n",
    "这种方法使我们能够在全局内存中进行合并的读写操作，因为我们可以在共享内存空间内执行转置，而共享内存阵列中的非连续读写不会造成性能损失。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numba.types\n",
    "\n",
    "@cuda.jit\n",
    "def tile_transpose(a_in, a_out):\n",
    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
    "    # and that `a_in` is a multiple of these dimensions.\n",
    "    \n",
    "    # 1) Create 32x32 shared memory array.\n",
    "    tile = cuda.shared.array((32, 32), numba.types.int32)\n",
    "\n",
    "    # Compute offsets into global input array.\n",
    "    row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
    "    \n",
    "    # 2) Make coalesced read from global memory into shared memory array.\n",
    "    # Note the use of local thread indices for the shared memory write,\n",
    "    # and global offsets for global memory read.\n",
    "    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a_in[col, row]\n",
    "\n",
    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
    "    cuda.syncthreads()\n",
    "    \n",
    "    # 4a) Calculate transposed location for the shared memory array tile\n",
    "    # to be written back to global memory...\n",
    "    row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
    "    col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
    "\n",
    "    # 4b) ...Write back to global memory,\n",
    "    # transposing each element within the shared memory array.\n",
    "    a_out[col, row] = tile[cuda.threadIdx.x, cuda.threadIdx.y]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a_out = cuda.device_array_like(a_in)\n",
    "\n",
    "%timeit tile_transpose[blocks_per_grid, threads_per_block](a_in, a_out); cuda.synchronize()\n",
    "print(a_out.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在已经加速的代码之上获得这样的加速比相当不错了。 （博客文章[Efficient Matrix Transpose](https://devblogs.nvidia.com/parallelforall/efficiency-matrix-transpose-cuda-cc/)展示了一种实现更高速度的方法，该方法是使一个线程块中的每个线程负责一个矩阵块(Tile)中的多个行，从而降低了计算线程索引值的成本。）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 评估\n",
    "\n",
    "下面的练习将用到您目前所学的全部知识。不同于之前的练习，本次练习不提供任何解决方案代码，且您还需采取一些其他步骤来“运行评估”，以获得操作分数。**请仔细阅读说明后再开始工作，确保以最大机率成功完成本次评估。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 如何运行评估\n",
    "\n",
    "请执行以下步骤完成评估：\n",
    "\n",
    "1. 按照以下说明，像平常练习一样运行下方单元。\n",
    "2. 若您对自己的执行效果甚感满意，请按照下方说明，将代码复制粘贴到所关联的源代码文件中。代码粘贴完成后，务必保存文件。\n",
    "3. 返回至您用来启动此笔记本的浏览器选项卡，然后点击**“Assess”（评估）**按钮。几秒后会生成分数，同时还将提供一条实用信息。\n",
    "\n",
    "您可视需要点击**“Assess”（评估）**按钮，次数不限。如果您首次未获通过，也不必担心，只需对代码作出其他修改并重复以上三个步骤，即可再次进行评估。祝您好运！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Run the assessment](images/run_assess_task.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用共享内存实现矩阵乘法\n",
    "\n",
    "本练习中，您将完成矩阵乘法核函数。该函数会使用共享内存来缓存输入矩阵中的值，以便仅需从全局内存对这些值执行一次访问。之后，在计算线程输出元素时即可使用这些缓存值。此项评估旨在测试您是否理解二维并行问题及使用共享内存。该问题不涉及大量运算，也未使用超大型数据集。因此，与简单的 CPU 程序相比，速度增加并不会特别明显。然而，一旦掌握所需技巧，您便能在真正需要加速某些涉及二维数据集的程序时，从容应对各类情况。\n",
    "\n",
    "为继续专注于共享内存，此问题假设输入向量为 MxN 和 NxM 维度，且每线程块包含 NxN 个线程，每个网格包含 M/N 个线程块。这表示，若共享内存所缓存的元素数量等于每个块的线程数，我们便足以从输入矩阵中提供运算所需的全部元素，且无需跨越网格。\n",
    "\n",
    "下图展示了输入矩阵、输出矩阵、在输出矩阵中将使用线程块计算输出值的区域、输入矩阵中该线程块将会缓存数据的区域，以及该块中单个线程所需的输入和输出的数据元素：\n",
    "\n",
    "![矩阵乘法图](images/mm_image.png)\n",
    "\n",
    "在核函数中已经分配好共享内存`a_cache`和`b_cache`做为缓存了。接下来，您需完成以下两项任务：\n",
    "1. 使用块中的每个线程为每个缓存填充一个数据元素。\n",
    "2. 在计算每个线程的`sum`值时使用共享内存。\n",
    "\n",
    "请务必执行任何可能需要的线程同步，以避免其他线程所写的缓存值尚不可用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提示"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**在编码之前仔细考虑一下空间挑战**\n",
    "\n",
    "这个问题所需的空间推理并不简单。 您将需要考虑线程和块的配置，共享内存读取和写入，以及所有需要转置的2D数据集的上下文。尽管这与CUDA编程语法没有直接关系，但它可以帮助您成为强大的并行编程程序员。在实际尝试编写代码之前，请考虑花时间在纸上画出问题和解决方案。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**处理GPU错误**\n",
    "\n",
    "对于像这样的具有挑战性的内存访问模式问题，通常会出错。 第2节附录中介绍了在GPU上挖掘错误的技术，但是您可能还没有时间消化它们，因此这里有几点可以帮助您。\n",
    "\n",
    "* 因为我们是在Jupyter环境中工作，如果您在CUDA代码中遇到错误，则可能需要重新启动Jupyter内核。您可以随时通过使用上面的*Kernel*菜单并选择*Restart*来执行此操作。它将清除本地内存，因此您需要在此之后重新运行导入和代码定义文件。\n",
    "* 如果收到“ UNKNOWN_CUDA_ERROR”，则可能是内存访问出现问题。重新启动Jupyter内核，并仔细考虑您的代码在何处进行超出范围的内存访问。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**可以处理多种输入大小的代码**\n",
    "\n",
    "在运行评估时，将根据几种不同的输入大小对您的代码进行评估，因此您将需要编写一个可以可靠地处理任意输入大小的内核。 在成功解决以下输入值之后，并在运行评估之前，建议您修改输入值（请参阅下面的单元格中的注释），以确保代码在更改输入大小时可以正常工作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda, types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# You do not need to edit the values in this cell, however, before running the assessment\n",
    "# it is recommended you try your code against different input sizes to make sure it can\n",
    "# handle arbitrary input sizes.\n",
    "M = 128\n",
    "N = 32\n",
    "\n",
    "# Consider using this value for N before running the assessment\n",
    "# to make sure your code handles arbitrary input sizes.\n",
    "# N = 8\n",
    "\n",
    "# Input vectors of MxN and NxM dimensions\n",
    "a = np.arange(M*N).reshape(M,N).astype(np.int32)\n",
    "b = np.arange(M*N).reshape(N,M).astype(np.int32)\n",
    "c = np.zeros((M, M)).astype(np.int32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_b = cuda.to_device(b)\n",
    "d_c = cuda.to_device(c)\n",
    "\n",
    "# NxN threads per block, in 2 dimensions\n",
    "block_size = (N,N)\n",
    "# MxM/NxN blocks per grid, in 2 dimensions\n",
    "grid_size = (int(M/N),int(M/N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在对以下单元中的 `mm_shared` 作出任何修改后，请将此单元的内容粘贴至 [**`assessment/definition.py`**](../edit/assessment/definition.py)，之后再运行评估。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda, types\n",
    "@cuda.jit\n",
    "def mm_shared(a, b, c):\n",
    "    column, row = cuda.grid(2)\n",
    "    sum = 0\n",
    "\n",
    "    # `a_cache` and `b_cache` are already correctly defined\n",
    "    a_cache = cuda.shared.array(block_size, types.int32)\n",
    "    b_cache = cuda.shared.array(block_size, types.int32)\n",
    "\n",
    "    # TODO: use each thread to populate one element each a_cache and b_cache\n",
    "    \n",
    "    for i in range(a.shape[1]):\n",
    "        # TODO: calculate the `sum` value correctly using values from the cache \n",
    "        sum += a_cache[0][0] * b_cache[0][0]\n",
    "        \n",
    "    c[row][column] = sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# There's no need to update this kernel launch\n",
    "mm_shared[grid_size, block_size](d_a, d_b, d_c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Do not modify the contents in this cell\n",
    "from numpy import testing\n",
    "solution = a@b\n",
    "output = d_c.copy_to_host()\n",
    "# This assertion will fail until you correctly update the kernel above.\n",
    "testing.assert_array_equal(output, solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 总结\n",
    "\n",
    "您已完成本节学习，现已能够：\n",
    "\n",
    "* 使用多维线程块和网格，对多维数据集执行 GPU 加速并行工作。\n",
    "* 使用共享内存在片上缓存数据，同时减少缓慢的全局内存访问。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 下载内容\n",
    "\n",
    "如要下载此笔记本的内容，请执行以下单元，然后点击下方的下载链接。注意：由于笔记本中的部分文件路径链接是专为我们的平台量身设计，若您在本地 Jupyter 服务器上运行此笔记本，这些链接可能会遭到损坏。不过，您仍可通过 Jupyter 文件导航器导航至这些文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!tar -zcvf section3.tar.gz ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[下载本节文件。](files/section3.tar.gz)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 附录：无存储块冲突的矩阵转置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "共享内存分成多个**存储块**。共享内存一共有 32 个存储块，且不使用同一存储块的内存读写可以同时运行。当并行线程尝试访问同一存储块内的内存时，我们将这种情况称为**存储块冲突**，该冲突将导致操作的顺序化。即便出现存储块冲突，您也可十分快速地访问共享内存；但若创建能够避免存储块冲突的内存访问模式，您也可进一步优化应用程序。\n",
    "\n",
    "我们在此仅举一例，详情请参阅博文：[高效矩阵转置](https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda, types\n",
    "import numpy as np\n",
    "\n",
    "TILE_DIM = 32\n",
    "BLOCK_ROWS = 8\n",
    "TILE_DIM_PADDED = TILE_DIM + 1  # Read Mark Harris' blog post to find out why this improves performance!\n",
    "                                # https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/\n",
    "\n",
    "@cuda.jit\n",
    "def tile_transpose_no_bank_conflict(a_in, a_out):\n",
    "    # THIS CODE ASSUMES IT IS RUNNING WITH A BLOCK DIMENSION OF (TILE_SIZE x TILE_SIZE)\n",
    "    # AND INPUT IS A MULTIPLE OF TILE_SIZE DIMENSIONSx\n",
    "    tile = cuda.shared.array((TILE_DIM, TILE_DIM_PADDED), types.int32)\n",
    "\n",
    "    x = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.y\n",
    "    \n",
    "    for j in range(0, TILE_DIM, BLOCK_ROWS):\n",
    "        tile[cuda.threadIdx.y + j, cuda.threadIdx.x] = a_in[y + j, x] # move tile into shared memory\n",
    "\n",
    "    cuda.syncthreads()  # wait for all threads in the block to finish updating shared memory\n",
    "\n",
    "    # Compute transposed offsets\n",
    "    x = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.y\n",
    "\n",
    "    for j in range(0, TILE_DIM, BLOCK_ROWS):\n",
    "        a_out[y + j, x] = tile[cuda.threadIdx.x, cuda.threadIdx.y + j];\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "size = 8192\n",
    "a_in = cuda.to_device(np.arange(size*size, dtype=np.int32).reshape((size, size)))\n",
    "a_out = cuda.device_array_like(a_in)\n",
    "\n",
    "print(a_in.copy_to_host())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grid_shape = (int(size/TILE_DIM), int(size/TILE_DIM))\n",
    "\n",
    "%timeit tile_transpose_no_bank_conflict[grid_shape,(TILE_DIM, BLOCK_ROWS)](a_in, a_out); cuda.synchronize()\n",
    "print(a_out.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"标题\" style=\"width: 400px;\"/> </a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
